Item Response Theory Modeling for Microarray Gene Expression Data
نویسنده
چکیده
The high dimensionality of global gene expression profiles, where number of variables (genes) is very large compared to the number of observations (samples), presents challenges that affect generalizability and applicability of microarray analysis. Latent variable modeling offers a promising approach to deal with high-dimensional microarray data. The latent variable model is based on a few latent variables that capture most of the gene expression information. Here, we describe how to accomplish a reduction in dimension by a latent variable methodology, which can greatly reduce the number of features used to characterize microarray data. We propose a general latent variable framework for prediction of predefined classes of samples using gene expression profiles from microarray experiments. The framework consists of (i) selection of smaller number of genes that are most differentially expressed between samples, (ii) dimension reduction using hierarchical clustering, where each cluster partition is identified as latent variable, (iii) discretization of gene expression matrix, (iv) fitting the Rasch item response model for genes in each cluster partition to estimate the expression of latent variable, and (v) construction of prediction model with latent variables as covariates to study the relationship between latent variables and phenotype. Two different microarray data sets are used to illustrate a general framework of the approach. We show that the predictive performance of our method is comparable to the current best approach based on an all-gene space. The method is general and can be applied to the other high-dimensional data problems.
منابع مشابه
Integration and Reduction of Microarray Gene Expressions Using an Information Theory Approach
The DNA microarray is an important technique that allows researchers to analyze many gene expression data in parallel. Although the data can be more significant if they come out of separate experiments, one of the most challenging phases in the microarray context is the integration of separate expression level datasets that have gathered through different techniques. In this paper, we prese...
متن کاملFeature Selection and Classification of Microarray Gene Expression Data of Ovarian Carcinoma Patients using Weighted Voting Support Vector Machine
We can reach by DNA microarray gene expression to such wealth of information with thousands of variables (genes). Analysis of this information can show genetic reasons of disease and tumor differences. In this study we try to reduce high-dimensional data by statistical method to select valuable genes with high impact as biomarkers and then classify ovarian tumor based on gene expression data of...
متن کاملModification of the Fast Global K-means Using a Fuzzy Relation with Application in Microarray Data Analysis
Recognizing genes with distinctive expression levels can help in prevention, diagnosis and treatment of the diseases at the genomic level. In this paper, fast Global k-means (fast GKM) is developed for clustering the gene expression datasets. Fast GKM is a significant improvement of the k-means clustering method. It is an incremental clustering method which starts with one cluster. Iteratively ...
متن کاملGene Identification from Microarray Data for Diagnosis of Acute Myeloid and Lymphoblastic Leukemia Using a Sparse Gene Selection Method
Background: Microarray experiments can simultaneously determine the expression of thousands of genes. Identification of potential genes from microarray data for diagnosis of cancer is important. This study aimed to identify genes for the diagnosis of acute myeloid and lymphoblastic leukemia using a sparse feature selection method. Materials and Methods: In this descriptive study, the expressio...
متن کاملClassification and Biomarker Genes Selection for Cancer Gene Expression Data Using Random Forest
Background & objective: Microarray and next generation sequencing (NGS) data are the important sources to find helpful molecular patterns. Also, the great number of gene expression data increases the challenge of how to identify the biomarkers associated with cancer. The random forest (RF) is used to effectively analyze the problems of large-p and smal...
متن کامل